Metaphor—A workflow for streamlined assembly and binning of metagenomes

Abstract Recent advances in bioinformatics and high-throughput sequencing have enabled the large-scale recovery of genomes from metagenomes. This has the potential to bring important insights as researchers can bypass cultivation and analyze genomes sourced directly from environmental samples. There are, however, technical challenges associated with this process, most notably the complexity of computational workflows required to process metagenomic data, which include dozens of bioinformatics software tools, each with their own set of customizable parameters that affect the final output of the workflow. At the core of these workflows are the processes of assembly—combining the short-input reads into longer, contiguous fragments (contigs)—and binning, clustering these contigs into individual genome bins. The limitations of assembly and binning algorithms also pose different challenges depending on the selected strategy to execute them. Both of these processes can be done for each sample separately or by pooling together multiple samples to leverage information from a combination of samples. Here we present Metaphor, a fully automated workflow for genome-resolved metagenomics (GRM). Metaphor differs from existing GRM workflows by offering flexible approaches for the assembly and binning of the input data and by combining multiple binning algorithms with a bin refinement step to achieve high-quality genome bins. Moreover, Metaphor generates reports to evaluate the performance of the workflow. We showcase the functionality of Metaphor on different synthetic datasets and the impact of available assembly and binning strategies on the final results.


Introduction
Genome-resolved metagenomics (GRM) is a set of techniques for the recovery of genomes from high-throughput sequencing data.
Applications of GRM have led to unprecedented insight into microbial diversity, ecology, and evolution, due to the recovery of (mostly uncultivated) metagenome-assembled genomes (MAGs) [1,2,3,4]. MAGs are essentially "bins" of contigs that are clustered together based on differential coverage and sequence composition; a bin is considered a MAG when it displays a high degree of completeness and a low degree of redundancy/contamination, which is usually calculated through the presence of marker genes in the bin. Advances in GRM have consistently improved the quality of recovered MAGs, and large-scale studies reconstructing and analysing thousands of MAGs have become prominent in microbiology research. Even with the inherent biases that accompany the generation of MAGs, it is evident that the benefits outweigh the risks, and researchers are increasingly in need of automated data processing methods for assembling and binning metagenomes [5]. Data pipelines that perform such experiments are inherently complex, have high computing cost, use heterogeneous data sources, have dozens of customisable parameters, and depend on several specialised bioinformatics software [6,7].
An additional domain-specific challenge for GRM studies is the strategy used for assembling and binning each sequenced sample. Data (raw reads generated by the sequencer) originating from multiple samples may be assembled separately or pooled together, depending whether they come from the same population, specimen, or environment. This results in either a set of contigs for each sample or a 'coassembly' of the pooled samples. Similarly, in the metagenome binning step, where contigs are clustered into genome bins, one may do this individually for each set of assembled contigs, or by pooling together contigs from multiple samples and then mapping each individual sample to this catalogue of contigs ('cobinning') [8]. The latter approach allows binning algorithms to account for differential coverage of contigs across samples, enriching the information available for clustering. The chosen strategy for assembly and binning may have important consequences for the final results, i.e., the quality of the assembly and of the recovered bins [8]. It is hypothesised that pooled assembly and binning may lead to improved results when analysing communities with high genetic diversity, and to poorer results when there is a high level of intraspecies/strain-level diversity [9], Here we present Metaphor, an automated and flexible workflow for the assembly and binning of metagenomes, which recovers prokaryotic genomes from metagenomes efficiently and with high sensitivity, and provides taxonomic and functional abundance data for quantitative metagenome analyses. Our software advances existing metagenomic pipelines by combining two core features: the usage of multiple binning software along with a binning refinement step, and the possibility of defining groups for assembly and binning of samples. This effectively allows scaling Metaphor to process multiple datasets in a single execution, performing assembly and binning in separate batches for each dataset, and avoiding the need for repeated executions with different input datasets. The workflow includes native functionality for downstream integration with 'omics statistical toolkits [10,11], so that abundance data can be easily imported into these tools, and with the Anvi'o [12] platform, which allows importing the collections of bins generated by Metaphor along with contig coverage data. Metaphor generates detailed performance metrics at the end of each module of the workflow to provide users with a high-level summary of their analysis, and has been designed to be user-friendly, portable, and flexible, as users can choose between different strategies for assembly and binning. We demonstrate its functionality using different synthetic datasets and discuss how these different strategies can impact data analyses in terms of quality of the resulting assembly and genome bins.

Design and Implementation
Metaphor stands out from existing GRM pipelines by offering flexible options for assembly and binning combined with multiple binning software and a binning refinement step. See Ta-ble 1 for a comparison of Metaphor's features with other stateof-the-art GRM workflows. The workflow is implemented with Snakemake [13], a widely-used scientific workflow management system. In each module, computing steps (called "rules" by Snakemake) consist of both third-party bioinformatics software [14,15,16,17,18,19,20,21,22,23,24,25,26,27,28] and custom scripts that connect different parts of the workflow, listed on Table 2.
The workflow consists of six modules: quality control (QC), assembly, annotation, mapping, binning, and postprocessing. In the QC module, raw sequencing reads are filtered and trimmed. Metagenomic assembly is then performed. Coding sequences are predicted from the assembled contigs and used for functional and taxonomic annotation. The quality-filtered reads are mapped against the contigs, generating coverage statistics employed by the binning algorithms. After binning is complete, bins are refined and dereplicated. Lastly, the postprocessing module renders runtime and memory usage metrics and generates an HTML report.  The choice of bioinformatics tools was informed by the results of the 2nd Critical Assessment for Metagenome Interpretation (CAMI II) [8,36], striving for the maximum trade-off between performance, efficiency, and software sustainability. Although the latter is a subjective factor, selecting and streamlining dependencies with regard to code quality, maintenance, and community support is a critical factor when maintaining complex bioinformatics pipelines [6,37]. Each third-party software (along with its version) is defined in an individual requirements file that is used by Snakemake to create a virtual environment and run that particular step. To facilitate citing these tools, Metaphor packages a bibs/ directory containing all citations in the Bibtext format.
The workflow takes two files as input: a tab-delimited file con- Table 1. Comparison of features between Metaphor and state-of-the-art GRM workflows as listed by [29]. Data adapted to include Metaphor.

Features
Metaphor

Module
Step Software  taining sample names and file paths to the raw reads, and a configuration file in the YAML format, which will set the workflow parameters (see Fig 1). These files can be automatically generated by Metaphor and edited by the user, or created from scratch. The output of Metaphor consists of a directory for each module, further subdivided into the rules within each module. This is described in detail in the documentation [38].

Assessment on CAMI II synthetic datasets
To demonstrate the functionality of Metaphor, we analysed datasets from CAMI II [8]. All datasets consist of short and long reads generated by simulation of collections of reference genomes [39]). Only short reads were used for each dataset, as Metaphor does not yet support long reads. Specifically, we used the Marine metagenome dataset (identified as 'marmg'), the Strain Madness dataset (identified as 'strmg'), and the Human Microbiome dataset, which consists of five sets of samples, each corresponding to a different sampling location in the human body, which were treated as distinct datasets (3). The following strategies were employed for each datasets: single assembly, single binning ('SASB'), where each sample is individually assembled and binned; single assembly, cobinning ('SACB'), where each sample is assembled individually and then binned with other samples from the same dataset; coassembly, cobinning ('CACB'), where all samples from the dataset were assembled and binned together. Table 4 illustrates how this works in practice, in terms of generated output files. Metaphor allows defining multiple groups for coassembly or cobinning to analyse multiple independent datasets with a single execution.
In order to assess the effect of different assembly strategies, we used MetaQUAST [18] to compare the assemblies generated by the workflow with the collections of reference genomes. For the different binning strategies, we compared metrics obtained from DAS Tool, the software used for dereplicating and evaluating genome bins, after a second round of dereplication with dRep [40]. This is because data generated with the SASB strategy will likely results in redundant bins, as for that strategy there is no dereplication between samples and since samples within a dataset have similar composition, it is likely that a genome bin can be generated repeatedly by different samples. dRep performs dereplication based on the Average Nucleotide Identity between genomes, a metric which has been consistently used as a proxy to differentiate taxonomy at the species and strain levels [41]. dRep was run with default clustering parameters, and without any length, completeness, or contamination cutoffs. We used Spartan [42], the High Performance Computing (HPC) system at The University of Melbourne to run the pipeline. Jobs were dispatched to nodes with the SLURM scheduler, using up to 64 processors and 300 GB RAM per node.

Results and Discussion
After running Metaphor on the CAMI II Marine, Strain Madness and Human Microbiome datasets, we illustrate the different outputs generated by the workflow, and compare the effects of different assembly and binning strategies on workflow performance.

Reconstruction of metagenome-assembled genomes
Metaphor produces genome bins generated with three tools: Vamb, MetaBAT2 and CONCOCT [25,26,27] that are refined with DAS Tool [28]. DAS Tool performs bin refinement through a "dereplication, aggregation and scoring" process, in which candidate bins are initially scored based on the presence/absence of single-copy marker genes (SCGs, which are a proxy for bin completeness). Redundant candidate bin sets are then aggregated and an iterative scoring process is performed, so only the best-quality, non-redundant bins remain; the bin score (S b ) increases with the number of SCGs and decreases with duplicate SCGs per bin. Please refer to [28], Figure 1 and Equation 1 for an overview of the DAS algorithm and the formula to determine the bin score, respectively. The input for each binning tool differs slightly, but they all rely on the catalogue of contigs obtained from the assembly and the coverage files obtained from the read mapping module (see Fig 1). A report is generated for each of the binning groups (only one is generated if cobinning is performed), which highlights three key metrics: completeness, redundancy, and bin score. The first two metrics are calculated by the presence/absence of single-copy genes, and the latter is a function of the former two. Plots generated by an example report are shown in Fig 2. It is possible to compare the performance of the different binning software and obtain the proportion of bins above a specified particular quality threshold based on the bin score. The source table for the report is provided, so that users can generate custom reports and inspect specific individual bins. Bins that pass the quality threshold are stored in individual FASTA files, so they can easily be used for downstream analyses with tools such as CheckM or GTDB-Tk [43,44]. We chose not to include these software in the workflow as they rely on fairly large reference databases and/or contain several different steps that are dependent on third-party software, which would affect Metaphor's portability. Bin collections generated with Metaphor can be imported into the Anvi'o along with coverage data (BAM files), so users can use the interactive interface of Anvi'o to examine the bins.

Contig-level taxonomic and functional profiling
To facilitate quantitative metagenomics applications, Metaphor's annotation module generates contig-level functional and taxonomic profiles based on the NCBI COG database [23]. These are obtained by predicting coding sequences with Prodigal and then aligning the resulting amino acid files with Diamond [21,22] in the "iterative" mode. This setting performs repeated rounds of alignment, with an increasing degree of sensitivity when no hits are detected in the previous round. Abundances for each feature are calculated based on the coverage of all coding sequences which align to that feature. Fig 3 illustrates the profile visualisations offered by Metaphor: a heatmap of COG categories for the functional profile and a stacked barplot for the most abundant taxa (for the latter, one plot is generated for each taxonomic rank). The annotation module outputs count tables with both absolute and relative abundance values of taxa and functional categories, and may be directly imported by downstream statistical toolkits such as MixOmics or PhyloSeq [10,11].

Quality control and performance metrics
Additional outputs produced by Metaphor include the quality control reports from the fastp and FastQC tools, with a summary of FastQC outputs being produced by MultiQC [14,15,16]. A simple report is produced by the assembly module with sequence statistics of the assembled contigs (e.g. N50, number of contigs, total and mean length of contigs), and performance metrics. At the end of the workflow execution, the postprocessing module generates figures obtained from the "benchmark" files provided by Snakemake. These files contain process information such as runtime and memory consumption. Metaphor plots these metrics in two ways: total per rule and per-sample mean (Fig 4) as some rules run only once across all samples, while other rules run per sample. These plots help identify computational bottlenecks and assess whether computing resources are adequate.

Assembly and binning strategies
The effects of distinct assembly and binning strategies on the final output of metagenomic workflows are highly dependent on the data source and research context [8]. As such, the choice of individual or group assembly and binning can only be assessed a posteriori. We compared three different strategies: single assembly and single binning ('SASB'), single assembly and cobinning ('SACB'), and coassembly and cobinning ('CACB'), see Table 4 and Section 'As- sessment on CAMI II synthetic datasets' for details. For assembly, we used the five different groups in the Human Microbiome dataset along with the Strain Madness and Marine datasets. We only used the latter two datasets for the binning assessment.
We used six metrics to evaluate assembly performance: percentage of recovered genome fraction, size of the largest contig, duplication ratio, length of misassembled contigs, number of misassemblies, and number of mismatches per 100 thousand base pairs. High values for the first two metrics and low values for the last four indicate better performance. We observed a general trade-off between assembly completeness (represented by the first two metrics), and the number of errors in the assembly (represented by the last four metrics) ( Figure S1). In most datasets, assemblies were more complete and contiguous, albeit with more errors when the Coassembly strategy was used. The exception was the Strain Madness ('strmg') dataset, for which the Individual assembly was more complete and contiguous, albeit with more errors. This may be attributed to the high degree of strain/intraspecies diversity in that dataset [8]. A high degree of similarity between the related genomes likely confounds assembly algorithms, and pooling samples together may aggravate this effect [5].
To evaluate differences between binning strategies, we compared the number and quality of bins after refinement with DAS Tool. Bins generated with each approach were further dereplicated with dRep [40]. This is because the SASB strategy generates a set of bins for each sample, and datasets with similar composition will likely generate redundant bins, as there is no dereplication of bins between samples. Results varied significantly between the Marine and Strain Madness datasets. In both datasets, the mean bin score was the highest for the CACB strategy ( Figure S3). However, in the Strain Madness dataset, CACB produced a significantly lower number of bins (33 compared with 259 and 215 generated with SASB and SACB), which did not occur in the Marine dataset. The performance of each binning tool is also variable between strategies and is conditional on the characteristics of the original dataset, with no clear "winner", and each tool favouring particular performance metrics, in agreement with results from the 2nd CAMI Challenge [8]. Tools like DAS Tool attempt to conciliate the output of multiple binning algorithms to generate a consensus output which theoretically outperforms each individual algorithm.
Since the binning performance is assessed as a proxy of the combination of quantity and quality of generated bins, rather than only one metric or the other, we calculated the cumulative bin score (the sum of scores of all bins) and the number of bins above an increasing score threshold, shown on Fig 6. The higher the threshold, the more significant the differences between the cumulative scores, as only bins with the highest quality compose the score. For the Marine dataset, we observed a higher score and a larger number of bins in the CACB strategy and the exact opposite in the Strain Madness dataset. In both datasets, there was a clear difference between SASB before dereplication and the other strategies, confirming that several highly similar samples produce redundant bins. That difference was also present in the SACB strategy, albeit not so pronounced (see Figure S4 for the comparison of dereplicated and non-dereplicated data). This suggests that for both of these strategies, further dereplication is recommended [5]. Although the Strain Madness dataset shows fewer bins generated with CACB -a summary of the bins recovered with that dataset is displayed on ??. The cumulative bin score for that strategy remained similar to SACB and SASB above the 0.8 score threshold, since there are fewer bins with a score lower than that. In that same dataset, SASB showed the best performance, although differences were small above the 0.8 threshold. In the Marine dataset, there were more pronounced differences between strategies. CACB produced the larger quantity and higher cumulative score of bins, followed by SASB and SACB.
In summary, our results indicate that, for most metagenomic analysis scenarios, coassembly followed by cobinning is recommended, assuming that samples are sourced from a similar environment or population. The exception to this is when when there is a high level of intraspecies/strain-level diversity across samples, like in the Strain Madness dataset. In that scenario, single assembly followed by single binning is preferred, followed by dereplication of bins between samples. There is, however, a trade-off between the different approaches, as computational requirements are higher for the pooled strategies. Coassembly resulted in higher genome recovery fractions and larger contigs, although usually at the expense of a higher number of misassemblies and higher duplication ratio. When combining coassembly with cobinning, there is a remarkable improvement in the quantity and quality of bins generated for a diverse dataset (represented by the Marine dataset), where the difference was negligible in the Strain Madness dataset. Therefore, when deciding the assembly and binning strategy, it is important to consider the expected strain-level diversity and abundances of each individual genome, as the interaction between these factors is likely to limit the resolution of recovered bins. This is shown in the CAMI II challenge [8] (see Figure 1g); genomes with low strain diversity (i.e. are less than 95% similar with any other genomes) have higher correlation between sequencing coverage and recovered fraction than common genomes (≥ 95% similar to other genomes in the sample), although many times sequencing coverage was not all correlated with genome recovery fraction, specially for smaller bins that represent plasmids or circular elements.

Availability and Future Directions
Metaphor is available through Bioconda [45], a popular repository of bioinformatics software. It can be installed with a single command from the conda package manager [46] or from source using pip, the Python package manager. The installation of all third-party software used by Metaphor is handled automatically by Snakemake and conda. It can be easily deployed in different computing environments, such as high performance computing clusters and cloud instances, due to Snakemake's support of execution profiles. Metaphor is developed with documented best practices in workflow development [6,47], striving for reproducibility and transparency of its results. Data used for the testing Metaphor's installation (see documentation for details) is available from GitHub at https://github.com/vinisalazar/mg-example-data. This data is a subset of the CAMI I challenge data [36] that is reduced in size in order to run test commands in a reasonable time.
The workflow may be extended to support downstream tools such for genome analysis such as GTDB-Tk, CheckM, and dRep. This may help with further improvement of strain-level resolution in bins; there are a number of strategies for that, such as identification of misassembled contigs or using the assembly graph for variant detection [48,49]. New functionality may also be added for the identification of eukaryotic and viral contigs; Metaphor would benefit from new third-party software to facilitate the generation of non-prokaryotic bins in the near future. The output of Metaphor's 'annotation' module is suitable for ad hoc identification of eukaryotic and viral contigs; after selecting the annotated prokaryotic contigs, it is possible to filter them out, leaving unannotated (putative) eukaryotic and viral contigs. These can then be used as input for a eukaryotic or viral discovery pipeline [50,51,52], but this process could be further improved by facilitating the use of custom reference databases in the annotation module. This can also be done directly with the output of the assembly module, but in that case there won't be any screening for prokaryotic contigs. One drawback of this approach is that each eukaryotic/viral discovery pipeline has specific input data formatting requirements. This integration with non-prokaryotic pipelines, along with support for long reads, are priority features to be added to future major versions of Metaphor.

Data availability
This work uses data from the 2nd CAMI Challenge, available from [53,54,55]. Analysis code used in this manuscript is available from [56] Snapshots of our code and other data further supporting this work are openly available from the GigaDB repository [57].

Declarations
The authors declare they have no competing interests.   Each data point is a genome bin, and Y-axis depicts bin scores from 0 to 1. Columns separate datasets, and colours represent different strategies. Numbers underneath each bar show the number of data points for that bar. Bins sets were dereplicated with dRep. Figure S3. Boxplot of bin scores across different strategies for nondereplicated data. Same as Figure S2, but with non-dereplicated data. Each data point is a genome bin, and Y-axis depicts bin scores from 0 to 1. Columns separate datasets, and colours represent different strategies. Numbers underneath each bar show the number of data points for that bar.